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ABSTRACT 

o- 

Recent work in signal processing and statistics have focused on defining new regularization functions, which not 
only induce sparsity of the solution, but also take into account the structure of the problem. 1-7 We present in 
this paper a class of convex penalties introduced in the machine learning community, which take the form of a 
U I sum of £2- and ^-norms over groups of variables. They extend the classical group-sparsity regularization 8-10 in 
the sense that the groups possibly overlap, allowing more flexibility in the group design. We review efficient opti- 
mization methods to deal with the corresponding inverse problems, 11-13 and their application to the problem of 
learning dictionaries of natural image patches: 14-18 On the one hand, dictionary learning has indeed proven effec- 
tive for various signal processing tasks. 17 ' 19 On the other hand, structured sparsity provides a natural framework 
for modeling dependencies between dictionary elements. We thus consider a structured sparse regularization to 
learn dictionaries embedded in a particular structure, for instance a tree 11 or a two-dimensional grid. 20 In the 
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latter case, the results we obtain are similar to the dictionaries produced by topographic independent component 
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analysis. 21 
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00 ; 1. INTRODUCTION 



Sparse representations have recently drawn much interest in signal, image, and video processing. Under the 
assumption that natural images admit a sparse decomposition in some redundant basis (or so-called dictionary), 
several such models have been proposed, e.g., curvelets, 22 wedgelets, 23 bandlets 24 and more generally various sorts 
of wavelets. 25 Learned sparse image models were first introduced in the neuroscience community by Olshausen 
and Field 14, 15 for modeling the spatial receptive fields of simple cells in the mammalian visual cortex. The linear 
decomposition of a signal using a few atoms of a learned dictionary instead of predefined ones, has recently 
led to state-of-the-art results for numerous low-level image processing tasks such as denoising, inpainting 17 ' 19 ' 26 
or texture synthesis, 27 showing that sparse models are well adapted to natural images. Unlike decompositions 
based on principal component analysis, these models can rely on overcomplete dictionaries, with a number of 
atoms greater than the original dimension of the signals, allowing more flexibility to adapt the representation to 
the data. 

In addition to this recent interest from the signal and image processing communities for sparse modelling, 
statisticians have developed similar tools from a different point of view. In signal processing, one often represents 
a data vector y of fixed dimension m as a linear combination of p dictionary elements D = [d 1 , . . . , d p ] in R mxp . 
In other words, one looks for a vector a in MP such that y « Dec. When we assume a to be sparse — that is has 
a lot of zero coefficients, we obtain a sparse linear model and need appropriate regularization functions. When D 
is fixed, the columns d l can be interpreted as the elements of a redundant basis, for instance wavelets. 25 

Let us now consider a different problem occurring in statistics or machine learning. Given a training 
set (y^x 1 )™!, where the y l, s are scalars, and the x"s are vectors in W, the task is to predict a value for y 
from an observation x in M. p . This is usually achieved by learning a model from the training data, and the 
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simplest one is to assume that there exists a linear relationship y w x w, where w is a vector in W. Learning 
the model amounts to adapt w to the training set and denoting by y the vector in R™ whose entries are the 
y J 's, and X the matrix in R mx P the matrix whose rows are the x l 's, we end up looking for a vector w such 
that y w Xw. When one knows in advance that the vector w is sparse, a similar problem as in signal processing 
is raised, where X can be interpreted as a "dictionary" but is often called a set of "features" or "predictors". 
This is therefore not surprising that both communities have developed similar tools, the Lasso formulation, 28 
L2-boosting algorithm, 29 forward selection techniques 30 in statistics are respectively equivalent (up to minor 
details) to the basis pursuit problem, 31 matching and variants of orthogonal matching pursuit algorithms. 32 

Formally, the sparse decomposition problem of a signal y using a dictionary D amounts to finding a vector a 
minimizing the following cost function 

mm i||y-Da|| 2 + AV(a), 

where ip is a sparsity-inducing function, and A a regularization parameter. A natural choice is to use the £o quasi- 
norm, which counts the number of non-zero elements in a vector, leading however to an NP-hard problem, 33 
which is usually tackled with greedy algorithms. 32 Another approach consists of using a convex relaxation such 
as the £i-norm. Indeed, it is well known that the t\ penalty yields a sparse solution, but there is no analytic link 
between the value of A and the effective sparsity ||x||o that it yields. 

We consider in this paper recent sparsity-inducing penalties capable of encoding the structure of a signal 
decomposition on a redundant basis. The ^-norm primarily encourages sparse solutions, regardless of the 
potential structural relationships (e.g., spatial, temporal or hierarchical) existing between the variables. To cope 
with that issue, some effort has recently been devoted to designing sparsity-inducing regularizations capable 
of encoding higher-order information about the patterns of non-zero coefficients, some of these works coming 
from the machine learning/statistics literature 1-4 others from signal processing. 5 We use here the approach of 
Jcnatton ct al. 2 who consider sums of norms of appropriate subsets, or groups, of variables, in order to control 
the sparsity patterns of the solutions. The underlying optimization is usually difficult, in part because it involves 
nonsmooth components. We review strategies to address these problems, first when the groups are embedded in 
a tree, 1 ' 11 second in a general setting. 13 

Whereas these penalties have been shown to be useful for solving various problems in computer vision, bio- 
informatics, or neuroscience, 1-4 we address here the problem of learning dictionaries of natural image patches 
which exhibit particular relationships among their elements. Such a construction is motivated a priori by two 
distinct but related goals: first to potentially improve the performance of denoising, inpainting or other signal 
processing tasks that can be tackled based on the learned dictionaries, and second to uncover or reveal some 
of the natural structures present in images. In previous work, 11 we have for instance embedded dictionary 
elements into a tree, by using a hierarchical norm. 1 This model encodes a rule saying that a dictionary element 
can be used in the decomposition of a signal only if its ancestors in the tree are used as well, similarly as 
in the zerotree wavelet model. 34 In the related context of independent component analysis (ICA), Hyvarinen 
et al. 21 have arranged independent components (corresponding to dictionary elements) on a two-dimensional 
grid, and have modelled spatial dependencies between them. When learned on whitened natural image patches, 
this model exhibits "Gabor-likc" functions which are smoothly organized on the grid, which the authors call a 
topographic map. As shown in Ref. 20, such a result can be reproduced with a dictionary learning formulation 
using structured regularization. 

We use the following notation in the paper: Vectors are denoted by bold lower case letters and matrices by 
upper case ones. We define for q > 1 the ^ g -norm of a vector x in R m as ||x|| 9 = (X^i Ixil 9 ) 1 ^ 9 , where Xj denotes 
the z-th coordinate of x, and || 

x||oo — ni^i-i t ...,m 1^1 — hm^—j-QQ ||x||g. We also define the ^o~pscudo-norm as 
the number of nonzero elements in a vector:* ||x|| = #{i s.t. ^ 0} = lim 9 ^ + (X^i \ x i\ 9 )- We consider the 
Frobenius norm of a matrix X in R mxn : ||X|| F = (Y^iLi Sj=i ^HjY^^ where X^ denotes the entry of X at row 
i and column j. 

*Note that it would be more proper to write ||x||o instead of ||x||o to be consistent with the traditional notation ||x|| 9 . 
However, for the sake of simplicity, we will keep this notation unchanged in the rest of the paper. 



(1) 



This paper is structured as follows: Section 2 reviews the dictionary learning and structured sparsity frame- 
works, Section 3 is devoted to optimization techniques, and Section 4 to experiments with structured dictionary 
learning. Note that the material of this paper relies upon two of our papers published in the Journal of Machine 
Learning Research. 11, 13 



2. RELATED WORK 

We present in this section the dictionary learning framework and structured sparsity-inducing regularization 
functions. 



2.1 Dictionary Learning 

Consider a signal y in IR m . We say that y admits a sparse approximation over a dictionary D in M mxp , composed 
of p elements (atoms), when we can find a linear combination of a "few" atoms from D that is "close" to the 
original signal x. A number of practical algorithms have been developed for learning such dictionaries like the 
K-SVD algorithm, 35 the method of optimal directions (MOD), 16 stochastic gradient descent algorithms 14 or 
other online learning techniques, 18 which will be briefly reviewed in Section 3. This approach has led to several 
restoration algorithms, with state of the art results in image and video denoising, inpainting, demosaicing, 17, 19 
and texture synthesis. 27 

Given a training set of signals Y = [y 1 , . . . , y™] in R mxn , such as natural image patches, dictionary learning 
amounts to finding a dictionary which is adapted to every signal y 4 , in other words it can be cast as the following 
optimization problem 

i=l 

where A = [a 1 , . . . , a 11 ] are decomposition coefficients, tp is a sparsity-inducing penalty, and C is a constraint 
set, typically the set of matrices whose columns have less than unit £ 2 -norm: 

C = {DeR mxp :V.? = l,...,p, ||d»|| 2 < 1}. (3) 

To prevent D from being arbitrarily large (which would lead to arbitrarily small values of a), it is indeed 
necessary to constrain the dictionary with such a set C. We also remark that dictionary learning is an instance 
of matrix factorization problem, which can be equivalcntly rewritten 



1 



mm £||Y-DA||| + A^'(A), (4) 



with an appropriate function ip' . Noticing this interpretation of dictionary learning as a matrix factorization 
has a number of practical consequences. With adequate constraints on A and D, one can indeed recast several 
classical problems as regularized matrix factorization problems, for instance principal component analysis (PC A), 
non- negative matrix factorization (NMF), 36 hard and soft vector quantization (VQ). As a first consequence, all 
of these approaches can be addressed with similar algorithms, as shown in Ref. 18. A natural approach to 
approximately solve this non-convex problem is for instance to alternate between the optimization of D and A in 
Eq. (4), minimizing over one while keeping the other one fixed, 16 a technique also used in the K- means algorithm 
for vector quantization. 

Another approach consists of using stochastic approximations and use online learning algorithms. When n 
is large, finding the sparse coefficients A with a fixed dictionary D requires solving n sparse decomposition 
problems (1), which can be cumbersome. To cope with this issue, online learning techniques adopt a different 
iterative algorithmic scheme: At iteration t, they randomly draw one signal y* from the training set (or a 
mini-batch), and try to "improve" D given this observation. Assume indeed that n is large and that the image 
patches y l are i.i.d. samples drawn from an unknown distribution p(y), then Eq. (2) is asymptotically equivalent 
to 

(5) 



minE y ^ p(y) 



™™J;\\y-'Da\\l + \'ip(a) 



In order to optimize a cost function which includes an expectation, it is natural to use stochastic approximations. 37 
When ijj is the £i-norm, this problem is also under mild assumptions differentiable (see Mairal et al. 18 for more 
details), and a first order stochastic gradient descent step, 15,18 given a signal y* can can be written: 



D <- II C D + S t (y l - Da> 



(6) 



where St is the gradient step, lie is the orthogonal projector onto C. The vector a* carries the sparse coefficients 
obtained from the decomposition of y with the current dictionary D. When ip is the £o-norm, this iteration 
is heuristic but gives good results in practice, when ip is the ^i-norm, and assuming the solution of the sparse 
decomposition problem to be unique, this iteration exactly corresponds to a stochastic gradient descent algo- 
rithm. 37 Note that the vectors y* are assumed to be i.i.d. samples of the (unknown) distribution p(y). Even 
though it is often difficult to obtain such i.i.d. samples, the vectors y* are in practice obtained by cycling on a 
randomly permuted training set. The main difficulty in this approach is to take a good learning rate St. Other 
dedicated online learning algorithms have been proposed, 18 which can be shown to provide a stationary point of 
the optimization problem (5). All of these online learning techniques have shown to yield significantly speed-ups 
over classical alternative minimization approach, when n is large enough. 

Examples of dictionaries learned using the approach of Mairal et al. 18 are represented in Figure 1, and exhibit 
intriguing visual results. Some of the dictionary elements look like Gabor wavelets, whereas other elements are 
more difficult to interpret. As for the color image patches, we observe that most of the dictionary elements 
are gray, with a few low-frequency colored elements exhibiting complementary colors, a phenomenon already 
observed in image processing applications. 19 
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Figure 1. Examples of dictionary with p = 256 elements, learned on a database of 10 million natural 12 x 12 image patches 
when tjj is the ^i-norm, for grayscale patches on the left, and color patches in the right (after removing the mean color of 
each patch). Image taken from Ref. 38. 



2.2 Structured Sparsity 

We consider again the sparse decomposition problem presented in Eq. (1), but we allow tp to be different than 
the to or ^i-regularization, and we are interested in problems where the solution is beforehand not only assumed 
to be sparse — that is, the solution has only a few non-zero coefficients, but also to form non-zero patterns with 
a specific structure. It is indeed possible to encode additional knowledge in the regularization other than just 
sparsity. For instance, one may want the non-zero patterns to be structured in the form of non-overlapping 
groups, 8-10 in a tree, 1,11 or in overlapping groups. 2-7 As for classical non-structured sparse models, there 



are basically two lines of research, that either (a) deal with nonconvex and combinatorial formulations that 
are in general computationally intractable and addressed with greedy algorithms or (b) concentrate on convex 
relaxations solved with convex programming methods. We focus in this paper on the latter. 

When the sparse coefficients are organized in groups, a penalty encoding explicitly this prior knowledge can 
improve the prediction performance and/or intcrpretability of the learned models. 9,10 Denoting by Q a set of 
groups of indices, such a penalty takes the form: 



■0(a) 



sea 



(7) 



where ctj is the j-th entry of a for j in [l;p] = {1, . . . the vector a g in R' 3 ' records the coefficients of a 
indexed by g in Q, and the scalars rj g are positive weights. |j.|| g denotes here either the t% or ^-norms. Note 
that when Q is the set of singletons of [l;p], we get back the ^i-norm. Inside a group, the £ 2 - or ^-norm does 
not induce sparsity, whereas the sum over the groups can be interpreted as an ^i-nornr and indeed, when Q is a 
partition of variables are selected in groups rather than individually. When the groups overlap, ip is still a 
norm and sets groups of variables to zero together. 2 The latter setting has first been considered for hierarchies, 1 
and then extended to general group structures. 2 ''" Solving Eq. (1) in this context becomes challenging and is 
the topic of the next section. Before that, in order to better illustrate how such norms should be used and how 
to design a group structure inducing a desired sparsifying effect, we proceed by giving a few examples of group 
structures. 

2.2.1 One-dimensional Sequence. 

Given p variables organized in a sequence, suppose we want to select only contiguous nonzero patterns. A set of 
groups Q exactly producing such patterns is represent on Figure 2. It is indeed easy to show that by selecting 
a family of groups in Q represented in this figure, and setting the corresponding variables to zero, exactly leads 
to contiguous patterns of non-zero coefficients. The penalty (7) with this group structure produces therefore 
exactly the desired sparsity patterns. 




I I 
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Figure 2. (Left) The set of blue groups to penalize in order to select contiguous patterns in a sequence. (Right) In red, 
an example of such a nonzero pattern with its corresponding zero pattern (hatched area). Image taken from Ref. 2. 



2.2.2 Hierarchical Norms 

Another example of interest originally comes from the wavelet literature. It consists of modelling hierarchical 
relations between wavelet coefficients, which are naturally organized in a tree, due to the multiscale properties 
of wavelet decompositions. 25 The zero-tree wavelet model 34 indeed assumes that if a wavelet coefficient is set to 
zero, then it should be the case for all its descendants in the tree. This effect can in fact be exactly achieved with 
the convex regularization of Eq. (7), with an appropriate group structure presented in Figure 3. This penalty 
was originally introduced in the statistics community by Zhao et al., and found different applications, notably 
in topic models for text corpora. 11 

^The sum of positive values is equal to the ^i-norm of a vector carrying these values. 

*Note that other sparsity inducing norms have been introduced, 3 which are different and not equivalent to the one we 
consider in this paper. One should be careful when referring to "structured sparsity penalty with overlapping groups", 
since different generalizations of the selection of variable in groups have been proposed. 



Figure 3. Left: example of a tree-structured set of groups Q (dashed contours in red), corresponding to a tree T with 
p — 6 nodes represented by black circles. Right: example of a sparsity pattern induced by the tree-structured norm 
corresponding to Q: the groups {2, 4}, {4} and {6} are set to zero, so that the corresponding nodes (in gray) that form 
subtrees of T are removed. The remaining nonzero variables {1,3,5} form a rooted and connected subtree of T. This 
sparsity pattern obeys the following equivalent rules: (i) if a node is selected, so are all of its ancestors, (ii) if a node is 
not selected, then its descendant are not selected. Image taken from Ref. 11. 

2.2.3 Neighborhoods on a 2D-Grid 

Another group structure we are going to consider corresponds to the assumption that the dictionary elements 
can be organized on a 2D-grid, for example we might have p — 20 x 20 dictionary elements. To obtain a spatial 
regularization effect on the grid, it is possible to use as groups all the e x e neighborhoods on the grid, for 
example 3x3. The main effect of such a regularization is to encourage variables that are in a same neighborhood 
to be set to zero all together. Such dictionary structure has been used for instance in Ref. 13 for a background 
subtraction task (segmenting foreground objects from the background in a video). 

3. OPTIMIZATION FOR STRUCTURED SPARSITY 

We now present optimization techniques to solve Eq. (1) when ip is a structured norm (7). This is the main 
difficulty to overcome to learn structured dictionaries. We review here the techniques introduced in Refs. 11,13. 
More details can be found in these two papers. Other technique for dealing with sparsity-inducing penalties can 
also be found in Ref. 39. 

3.1 Proximal Gradient Methods 

In a nutshell, proximal methods can be seen as a natural extension of gradient-based techniques, and they are 
well suited to minimizing the sum / + Xijj of two convex terms, a smooth function / — continuously differentiable 
with Lipschitz-continuous gradient — and a potentially non-smooth function Xip (see Refs. 39,40 and references 
therein). In our context, the function / takes the form f(a) = i||y — Da|||. At each iteration, the function / 
is linearized at the current estimate cxq and the so-called proximal problem has to be solved: 

min f(a ) + (a - a ) T V/(a ) + X%f)(a.) + —\\a - a |||. 

The quadratic term keeps the solution in a neighborhood where the current linear approximation holds, and 
L>0 is an upper bound on the Lipschitz constant of V/. This problem can be rewritten as 

1 9 

min - 11/3 - all, + X'ip(a), (8) 

aGRP 2 

with A' = X/L, and (3 = c*o — f( a o)- We call proximal operator associated with the regularization X'tp 
the function that maps a vector (3 in W onto the (unique, by strong convexity) solution a* of Eq. (8). Simple 
proximal methods use a* as the next iterate, but accelerated variants 41 ' 42 are also based on the proximal operator 
and require to solve problem (8) efficiently to enjoy their fast convergence rates. 

This has been shown to be possible in many cases: 



When tp is the ^i-norm — that is ip(ot) = ||a||i — the proximal operator is the well-known elementwise 
soft-thresholding operator, 



Vje[l;p], /3,^sign(/3,)(|/3J-A). 



if 1/3^1 < A 

sign(/3 -)(|/3 -| — A) otherwise. 



• When ip is a group-Lasso penalty with £ 2 -norms — that is, tp((3) = J2 g eg \\Pg\\ 2 ' w ^ n ^ being a partition 
of [l;p], the proximal problem is separable in every group, and the solution is a generalization of the 
soft-thresholding operator to groups of variables: 

fO if ||/3J 2 <A 

where II|| .|| 2 <a denotes the orthogonal projection onto the ball of the £ 2 -norm of radius A. 

• When ip is a group-Lasso penalty with foo-norms — that is, ip(/3) = J2 g eg \\/3 g \\oo, with Q being a partition 
of [l;p], the solution is a different group-thresholding operator: 

V 5 e£, f3 g ^0 g -U Ul < x [f3 g ], 

where II||.||i<a denotes the orthogonal projection onto the fi-ball of radius A, which can be solved in 0(p) 
operations. 43,44 Note that when ||/3 g ||i < A, we have a group-thresholding effect, with f3 g — ] = 0. 

• When ijj is a tree-structured sum of £ 2 - or ^-norms as introduced by Ref. 1 — meaning that two groups 
are either disjoint or one is included in the other, the solution admits a closed form. Let X be a total order 
on Q such that for g\, g 2 in Q, g\ -< g 2 if and only if either g\ C gi or g\ n gi — 0. § Then, if g\ -< . . . -< g\g\, 
and if we define Prox 9 as (a) the proximal operator (3 g Prox Ar/s ||.||(/3 g ) on the subspace corresponding 
to group g and (b) the identity on the orthogonal, it is shown in Ref. 11 that: 

Prox A ^, = Prox 9 " 1 o . . . o Prox 91 , (9) 

which can be computed in 0(p) operations. It also includes the sparse group Lasso (sum of group-Lasso 
penalty and ^-norm) of Refs. 45 and 46. 

• When the groups overlap but do not have a tree structure, computing the proximal operator is more 
difficult, but it can still be done efficiently when q = 00. Indeed, as shown by Mairal et al., 12 there exists 
a dual relation between such an operator and a quadratic min-cost flow problem on a particular graph, 
which can be tackled using network flow optimization techniques. Moreover, it may be extended to more 
general situations where structured sparsity is expressed through submodular functions. 47 

Mainly using the tools of Refs. 11, 12, we are therefore able to efficiently solve Eq. (1), either in the case of 
hierarchical norms with £ 2 - or i^-norms, or with general group structures with ^-norms. This is one of the 
main requirements to be able to learn structured dictionary. The next section presents a different optimization 
technique, adapted to any group structure with £ 2 - or ^-norms. 

3.2 Augmenting Lagrangian Techniques 

We consider a class of algorithms which leverage the concept of variable splitting. 40, 48-50 The key is to introduce 
additional variables f3 9 in Rl 9 l, one for every group g in Q, and equivalently reformulate Eq. (1) as 

min + A ^TryJ^l, s.t. V. 9 G /3 9 = a„ (10) 

/3 9 GR lsl for geG geG 



For a tree-structured set Q, such an order exists. 



The issue of overlapping groups is removed, but new constraints and variables are added. 

To solve this problem, it is possible to use the so-called alternating direction method of multipliers (ADMM). 40 '' 
It introduces dual variables v 9 in R' 9 for all g in Q, and defines the augmented Lagrangian: 

r(a, (/3 9 ) see , (v°) g€0 ) 4 /(a) + 2 [A % ||/31 + ^ T (/3* - a s ) + I||/3» - a fl |||] , 

where 7 > is a parameter. It is easy to show that solving Eq. (10) amounts to finding a saddle-point of the 
augmented Lagrangian. H The ADMM algorithm finds such a saddle-point by iterating between the minimization 
of C with respect to each primal variable, keeping the other ones fixed, and gradient ascent steps with respect 
to the dual variables. More precisely, it can be summarized as: 

1. Minimize £ with respect to a, keeping the other variables fixed. 

2. Minimize £ with respect to the (3 9 's, keeping the other variables fixed. The solution can be obtained in 
closed form: for all g in Q, (3 9 <— prox.*^ ^ ^ [oc g — ^v 9 ]. 

3. Take a gradient ascent step on C with respect to the z^ s 's: v 9 <— v 9 + 7(/3 9 — a g ). 

4. Go back to step 1. 

Such a procedure is guaranteed to converge to the desired solution for all value of 7 > (however, tuning 7 can 
greatly influence the convergence speed), but solving efficiently step 1 can be difficult. To cope with this issues, 
several strategies have been proposed in Rcf. 13. For simplicity, we do not provide all the details here and refer 
the reader to Ref. 13 for more details. 

4. EXPERIMENTS WITH STRUCTURED DICTIONARIES 

We present here two experiments from Refs. 11 and 13 on learning structured dictionaries, one with a hierar- 
chical structure, one where the dictionary elements are organized on a 2D-grid. 20,21 In both experiments, we 
consider the dictionary learning formulation of Eq. (2), with a structured sparsity- inducing regularization for the 
function ip. 

4.1 Hierarchical Case 

We extracted patches from the Berkeley segmentation database of natural images, 53 which contains a high 
diversity of scenes. All the patches are centered (we remove the DC component) and normalized to have unit 
^2-norm. 

We present visual results on Figures 3 and 5, for different patch sizes and different group structures. For 
simplicity, the weights rf in Eq. (7) are chosen equal to one, and we choose a penalty ip which is a sum of £oo- 
norms. We solve the sparse decomposition problems (1) using the proximal gradient method of Section 3.1, and 
use an alternate minimization scheme to learn the dictionary, as explained in Section 2.1. The regularization 
parameter A is chosen manually. Dictionary elements naturally organize in groups of patches, often with low 
frequencies near the root of the tree, and high frequencies near the leaves. We also observe clear correlations 
between each parent node and their children in the tree, where children often look like their parent, but sharper 
and with small variations. 

^This method is used in Ref. 46 for computing the proximal operator associated to hierarchical norms, and in the same 
context as ours in Refs. 50 and 51. 

"The augmented Lagrangian is in fact the classical Lagrangian 52 of the following optimization problem which is 
equivalent to Eq. (10): 

min /( a ) + A^r, 9 ||/3 9 || + ^||/3 ! >-a 9 ||l s.t. \/g 6 Q, (3 3 = a g . 



This is of course a simple visual interpretation, which is intriguing, but which does not show that such a 
hierarchical dictionary can be useful for solving real problems. Some quantitative results can however be found 
in Ref. 11, with an inpainting experiment of natural image patches.** The conclusion of this experiment is that 
to reconstruct individual patches, hierarchical structures are helpful when there is a significant amount of noise. 




Figure 4. Learned dictionary with a tree structure of depth 4. The root of the tree is in the middle of the figure. The 
branching factors at depths 1,2,3 are respectively 10, 2, 2. The dictionary is learned on 50,000 patches of size 16 x 16 
pixels. Image taken from Ref. 11. 

4.2 Topographic Dictionary Learning 

In this experiment, we consider a database of n = 100 000 natural image patches of size rn = 12 x 12 pixels, for 
dictionaries of size p — 400. As done in the context of independent component analysis (ICA) 21 the dictionary 
elements are arranged on a two-dimensional grid, and we consider spatial dependencies between them. When 
learned on whitened natural image patches, this model called topographic ICA exhibits "Gabor-like" functions 
smoothly organized on the grid, which the authors call a topographic map. As shown in Ref. 20, such a result 
can be reproduced with a dictionary learning formulation, using a structured norm for ip. Following their 
formulation, we organize the p dictionary elements on a yjp x ^fp grid, and consider p overlapping groups that 
are 3 x 3 or 4 x 4 spatial neighborhoods on the grid (to avoid boundary effects, we assume the grid to be cyclic). 
We define ip as a sum of f 2 -norms over these groups, since the foe-norm has proven to be less adapted for this 
task. Another formulation achieving a similar effect was also proposed in Ref. 54 in the context of sparse coding 
with a probabilistic model. 

As presented in Section 2.1, we consider a projected stochastic gradient descent algorithm for learning D — 
that is, at iteration t, we randomly draw one signal y from the database Y, compute a sparse code a* which 
is a solution of Eq. (1), and use the update rule of Eq. (6). In practice, to further improve the performance, 
we use a mini-batch, drawing 500 signals at each iteration instead of one. 18 This approach mainly differs from 
Ref. 20 in the way the sparse codes a* are obtained. Whereas Ref. 20 uses a subgradient descent algorithm to 
solve them, we use the augmenting Lagrangian techniques presented in Section 3.2. The natural image patches 

**In this experiment, the patches do not overlap. Thus, this experiment does not study the reconstruction of a full 
images, where the patches usually overlap 17 ' 26 




Figure 5. Learned dictionary with a tree structure of depth 5. The root of the tree is in the middle of the figure. The 
branching factors at depths 1, 2, 3, 4 are respectively 10, 2, 2, 2. The dictionary is learned on 50, 000 patches of size 16 x 16 
pixels. Image taken from Ref. 11. 

we use are also preprocessed: They are first centered by removing their mean value, called DC component in the 
image processing literature, and whitened, as often done in the literature. 21, 54 The parameter A is chosen such 
that in average ||y* — Xa l ||2 ~ 0.4|| y z || 2 for every new patch considered by the algorithm, which yields visually 
interesting dictionaries. Examples of obtained results are shown on Figure 6 and 7, and exhibit similarities with 
the maps of topographic ICA. 21 

5. CONCLUSION 

We have presented in this paper different convex penalties inducing both sparsity and a particular structure in 
the solution of an inverse problem. Whereas their most natural application is to model the structure of non-zero 
patterns of parameter vectors of a problem, associated for instance to physical constraints in bio-informatics, 
neuroscience, they also constitute a natural framework for learning structured dictionaries. We for instance 
observe that given an arbitrary structure, the dictionary elements can self-organize to adapt to the structure. 
The results obtained when applying these methods to natural image patches are intriguing, similarly as the ones 
produced by topographic ICA. 21 
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Figure 6. Topographic dictionaries with 400 elements, learned on a database of 12 x 12 whitened natural image patches 
with 3x3 cyclic overlapping groups. Image taken from Ref. 13 
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Figure 7. Topographic dictionaries with 400 elements, learned on a database of 12 x 12 whitened natural image patches 
with 4x4 cyclic overlapping groups. Image taken from Ref. 13. 
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